This dataset contains:
The main questions to be answered for this project are:
import pandas as pd
import numpy as np
import sqlite3
from datetime import datetime
from collections import Counter
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.preprocessing import StandardScaler
import plotly.graph_objects as go
import plotly.express as px
con = sqlite3.connect("./database.sqlite")
def read_table(table_name, con):
return pd.read_sql_query(f"SELECT * FROM {table_name}", con)
"""
This function removes from matches dataframe those columns that have null values. It also dismiss the columns
'id' and 'match_api_id' because do not contribute with valuable information.
"""
def remove_unused_columns_matches(matches):
# print("Initial shape of matches dataframe: ", matches.shape)
count_nulls = matches.isnull().sum()
columns_to_remove = count_nulls[count_nulls > 0]
matches.drop(columns=columns_to_remove.index, inplace=True)
matches.drop(columns=["id", "match_api_id"], inplace=True)
# Check the shape is the correct one
assert matches.shape == (25979, 9)
"""
Function that casts the column 'date' from object type to datetime type.
"""
def convert_date_format(matches):
matches["date"] = matches["date"].astype("datetime64[ns]")
assert matches.dtypes["date"] == np.dtype("<M8[ns]")
"""
Function that adds the column 'country' to matches dataframe.
"""
def add_country_to_matches(matches, countries):
# Transform column 'country_id' to 'country' using the name of the countries
countries_dict = dict([(id, name) for id, name in zip(countries["id"], countries["name"])])
matches["country"] = matches["country_id"].map(countries_dict)
matches.drop(columns=["country_id"], inplace=True)
# Check the final dataframe has a column 'country' and does not have a column 'country_id'
assert matches.shape == (25979, 9)
assert "country" in matches.columns
assert "country_id" not in matches.columns
"""
Function that replaces the league IDs with league's names.
"""
def add_league_column_to_matches(matches, leagues):
leagues_dict = dict([id, name] for id, name in zip(leagues["id"], leagues["name"]))
matches["league"] = matches["league_id"].map(leagues_dict)
matches.drop(columns=["league_id"], inplace=True)
# Check the final dataframe has a column 'league' and does not have a column 'league_id'
assert matches.shape == (25979, 9)
assert "league" in matches.columns
assert "league_id" not in matches.columns
"""
Replaces the teams' IDs with the teams' names.
"""
def add_team_names_to_matches(matches, teams):
teams_dict = dict([id, name] for id, name in zip(teams["team_api_id"], teams["team_long_name"]))
matches["home_team"] = matches["home_team_api_id"].map(teams_dict)
matches["away_team"] = matches["away_team_api_id"].map(teams_dict)
matches.drop(columns=["home_team_api_id", "away_team_api_id"], inplace=True)
# Check the final dataframe has two new columns 'home_team' and 'away_team'
assert matches.shape == (25979, 9)
assert {"home_team", "away_team"}.issubset(matches.columns)
assert not {"home_team_api_id", "away_team_api_id"}.issubset(matches.columns)
The following function is a wrangling and cleaning function applied to matches dataframe. Its purpose is to prepare matches dataframe for the analysis.
def data_wrangling_matches(matches, countries, leagues, teams):
remove_unused_columns_matches(matches)
convert_date_format(matches)
add_country_to_matches(matches, countries)
add_league_column_to_matches(matches, leagues)
add_team_names_to_matches(matches, teams)
These are the queries to read theee tables as we have them stored in the sqlite database.
countries = read_table(table_name="Country", con=con)
leagues = read_table(table_name="League", con=con)
matches = read_table(table_name="Match", con=con)
teams = read_table(table_name="Team", con=con)
First, data_wrangling_matches function defined two cells above is applied to the matches dataframe, and here is the result of that process.
data_wrangling_matches(matches, countries, leagues, teams)
matches
| season | stage | date | home_team_goal | away_team_goal | country | league | home_team | away_team | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2008/2009 | 1 | 2008-08-17 | 1 | 1 | Belgium | Belgium Jupiler League | KRC Genk | Beerschot AC |
| 1 | 2008/2009 | 1 | 2008-08-16 | 0 | 0 | Belgium | Belgium Jupiler League | SV Zulte-Waregem | Sporting Lokeren |
| 2 | 2008/2009 | 1 | 2008-08-16 | 0 | 3 | Belgium | Belgium Jupiler League | KSV Cercle Brugge | RSC Anderlecht |
| 3 | 2008/2009 | 1 | 2008-08-17 | 5 | 0 | Belgium | Belgium Jupiler League | KAA Gent | RAEC Mons |
| 4 | 2008/2009 | 1 | 2008-08-16 | 1 | 3 | Belgium | Belgium Jupiler League | FCV Dender EH | Standard de Liège |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 25974 | 2015/2016 | 9 | 2015-09-22 | 1 | 0 | Switzerland | Switzerland Super League | FC St. Gallen | FC Thun |
| 25975 | 2015/2016 | 9 | 2015-09-23 | 1 | 2 | Switzerland | Switzerland Super League | FC Vaduz | FC Luzern |
| 25976 | 2015/2016 | 9 | 2015-09-23 | 2 | 0 | Switzerland | Switzerland Super League | Grasshopper Club Zürich | FC Sion |
| 25977 | 2015/2016 | 9 | 2015-09-22 | 0 | 0 | Switzerland | Switzerland Super League | Lugano | FC Zürich |
| 25978 | 2015/2016 | 9 | 2015-09-23 | 4 | 3 | Switzerland | Switzerland Super League | BSC Young Boys | FC Basel |
25979 rows × 9 columns
To verify the non-nulls values of the dataframe it is applied the info() function.
matches.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 25979 entries, 0 to 25978 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 season 25979 non-null object 1 stage 25979 non-null int64 2 date 25979 non-null datetime64[ns] 3 home_team_goal 25979 non-null int64 4 away_team_goal 25979 non-null int64 5 country 25979 non-null object 6 league 25979 non-null object 7 home_team 25979 non-null object 8 away_team 25979 non-null object dtypes: datetime64[ns](1), int64(3), object(5) memory usage: 1.8+ MB
Once everything is cleaned and ordered, it is time to start answering the first question: 'What teams improved the most over the time period?'
Before we answer this question we can plot the total goals scored per season per team.
goals_for = pd.concat(
[
matches[["season", "home_team", "home_team_goal"]].rename(columns={"home_team": "team", "home_team_goal": "for"}),
matches[["season", "away_team", "away_team_goal"]].rename(columns={"away_team": "team", "away_team_goal": "for"}),
]
)
goals_for = goals_for.groupby(["season", "team"]).sum()
goals_against = pd.concat(
[
matches[["season", "home_team", "away_team_goal"]].rename(columns={"home_team": "team", "away_team_goal": "against"}),
matches[["season", "away_team", "home_team_goal"]].rename(columns={"away_team": "team", "home_team_goal": "against"}),
]
)
goals_against = goals_against.groupby(["season", "team"]).sum()
goals = pd.merge(
goals_for.reset_index(),
goals_against.reset_index(),
on=["season", "team"],
)
goals["gd"] = goals["for"] - goals["against"]
goals.sort_values(by=["season"], inplace=True, ignore_index=True)
goals
| season | team | for | against | gd | |
|---|---|---|---|---|---|
| 0 | 2008/2009 | 1. FC Köln | 35 | 50 | -15 |
| 1 | 2008/2009 | P. Warszawa | 40 | 23 | 17 |
| 2 | 2008/2009 | PSV | 71 | 33 | 38 |
| 3 | 2008/2009 | Palermo | 57 | 50 | 7 |
| 4 | 2008/2009 | Paris Saint-Germain | 49 | 38 | 11 |
| ... | ... | ... | ... | ... | ... |
| 1473 | 2015/2016 | Frosinone | 35 | 76 | -41 |
| 1474 | 2015/2016 | GFC Ajaccio | 37 | 58 | -21 |
| 1475 | 2015/2016 | Genoa | 45 | 48 | -3 |
| 1476 | 2015/2016 | FC Thun | 45 | 54 | -9 |
| 1477 | 2015/2016 | Śląsk Wrocław | 28 | 37 | -9 |
1478 rows × 5 columns
The next plot shows the total goals scored along the 8 seasons.
fig = go.Figure()
for team, group in goals.groupby(["team"]):
if group.shape[0] == 8:
group.sort_values(by=["season"], inplace=True)
fig.add_trace(go.Scatter(x=group["season"], y=group["for"], name=team, connectgaps=False))
fig.update_layout(
title="Total goals per team over 8 seasons",
xaxis_title="Seasons",
yaxis_title="Goals",
legend_title_text="Teams",
)
fig.show()
From this plot it is clear that teams like FC Barcelona and Real Madrid CF are the ones that lead the goal scoring of all clubs registered on this European Soccer database. However that does not imply that they have improved over time. So, it is necessary to find another way to represent improvement.
One proposal could be calculating the points each team gets after each match. In order to do that two extra columns are going to be added to the matches dataframe: home_team_points and away_team_points. They will have the following values: if a team won a game it gets three points, if it losses the game zero points, otherwise both teams get one point.
condlist = [
np.array(matches["home_team_goal"] > matches["away_team_goal"]).reshape(-1, 1),
np.array(matches["home_team_goal"] < matches["away_team_goal"]).reshape(-1, 1),
]
choicelist = [[3,0], [0, 3]]
points = pd.DataFrame(
np.select(condlist=condlist, choicelist=choicelist, default=[1, 1]),
columns=["home_team_points", "away_team_points"],
)
matches = pd.concat(
[
matches,
points,
],
axis=1,
)
matches
| season | stage | date | home_team_goal | away_team_goal | country | league | home_team | away_team | home_team_points | away_team_points | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2008/2009 | 1 | 2008-08-17 | 1 | 1 | Belgium | Belgium Jupiler League | KRC Genk | Beerschot AC | 1 | 1 |
| 1 | 2008/2009 | 1 | 2008-08-16 | 0 | 0 | Belgium | Belgium Jupiler League | SV Zulte-Waregem | Sporting Lokeren | 1 | 1 |
| 2 | 2008/2009 | 1 | 2008-08-16 | 0 | 3 | Belgium | Belgium Jupiler League | KSV Cercle Brugge | RSC Anderlecht | 0 | 3 |
| 3 | 2008/2009 | 1 | 2008-08-17 | 5 | 0 | Belgium | Belgium Jupiler League | KAA Gent | RAEC Mons | 3 | 0 |
| 4 | 2008/2009 | 1 | 2008-08-16 | 1 | 3 | Belgium | Belgium Jupiler League | FCV Dender EH | Standard de Liège | 0 | 3 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 25974 | 2015/2016 | 9 | 2015-09-22 | 1 | 0 | Switzerland | Switzerland Super League | FC St. Gallen | FC Thun | 3 | 0 |
| 25975 | 2015/2016 | 9 | 2015-09-23 | 1 | 2 | Switzerland | Switzerland Super League | FC Vaduz | FC Luzern | 0 | 3 |
| 25976 | 2015/2016 | 9 | 2015-09-23 | 2 | 0 | Switzerland | Switzerland Super League | Grasshopper Club Zürich | FC Sion | 3 | 0 |
| 25977 | 2015/2016 | 9 | 2015-09-22 | 0 | 0 | Switzerland | Switzerland Super League | Lugano | FC Zürich | 1 | 1 |
| 25978 | 2015/2016 | 9 | 2015-09-23 | 4 | 3 | Switzerland | Switzerland Super League | BSC Young Boys | FC Basel | 3 | 0 |
25979 rows × 11 columns
Once the points per match were calculated, given how data was stored, we have to reorder it to just get the points per team.
points = pd.concat(
[
matches[["country", "season", "date", "home_team", "home_team_points"]].rename(columns={"home_team": "team", "home_team_points": "points"}),
matches[["country", "season", "date", "away_team", "away_team_points"]].rename(columns={"away_team": "team", "away_team_points": "points"}),
],
ignore_index=True,
)
points.sort_values(by=["team", "date"], ignore_index=True, inplace=True)
points
| country | season | date | team | points | |
|---|---|---|---|---|---|
| 0 | Germany | 2010/2011 | 2010-08-21 | 1. FC Kaiserslautern | 3 |
| 1 | Germany | 2010/2011 | 2010-08-27 | 1. FC Kaiserslautern | 3 |
| 2 | Germany | 2010/2011 | 2010-09-12 | 1. FC Kaiserslautern | 0 |
| 3 | Germany | 2010/2011 | 2010-09-18 | 1. FC Kaiserslautern | 1 |
| 4 | Germany | 2010/2011 | 2010-09-22 | 1. FC Kaiserslautern | 0 |
| ... | ... | ... | ... | ... | ... |
| 51953 | Poland | 2015/2016 | 2016-03-08 | Śląsk Wrocław | 0 |
| 51954 | Poland | 2015/2016 | 2016-03-11 | Śląsk Wrocław | 1 |
| 51955 | Poland | 2015/2016 | 2016-03-19 | Śląsk Wrocław | 1 |
| 51956 | Poland | 2015/2016 | 2016-04-01 | Śląsk Wrocław | 3 |
| 51957 | Poland | 2015/2016 | 2016-04-09 | Śląsk Wrocław | 3 |
51958 rows × 5 columns
Let's plot the points over time for the teams that scored more goals: FC Barcelona and Real Madrid CF.
barcelona = points.loc[points["team"] == "FC Barcelona", ["date", "points"]]
barcelona.sort_values(by=["date"], inplace=True, ignore_index=True)
real_madrid = points.loc[points["team"] == "Real Madrid CF", ["date", "points"]]
real_madrid.sort_values(by=["date"], inplace=True, ignore_index=True)
fig = go.Figure()
fig.add_trace(go.Scatter(x=barcelona.index, y=barcelona["points"].cumsum(), name="FC Barcelona"))
fig.add_trace(go.Scatter(x=real_madrid.index, y=real_madrid["points"].cumsum(), name="Real Madrid CF"))
fig.update_layout(
title="Points accumulated over all matches",
xaxis_title="matches",
yaxis_title="points",
)
fig.show()
This representation does not tell us clearly whether the teams are improving or not, because at least for this two teams, they are on top rankings all the time. So, another strategy could be to check how they are improving during a season, and once we have that number it's possible to calculate the general improvement over all seasons.
How to represent the improvement over time? It is possible to calculate the linear trend of the points that teams are getting on each match. The linear trend is going to be the linear regression model of the average of points gotten in three consecutive matches. More specifically, we'll use the regression coefficient as the slope of the linear regression model, so if the slope is positive, that means there is a performance improvement, otherwise the team performance is getting worse over time.
slopes = points[["country", "team"]].drop_duplicates().set_index("team")
groups = points.groupby(["season", "team"])
slopes = points[["country", "team"]].drop_duplicates().set_index("team")
for group_name, team_points in groups:
team_points = team_points.groupby(np.arange(len(team_points)) // 3).mean()
X = np.arange(len(team_points)).reshape(-1, 1)
y = team_points["points"].to_numpy()
regressor = LinearRegression()
regressor.fit(X, y)
slopes.loc[group_name[1], group_name[0]] = regressor.coef_
slopes.head(5)
| country | 2008/2009 | 2009/2010 | 2010/2011 | 2011/2012 | 2012/2013 | 2013/2014 | 2014/2015 | 2015/2016 | |
|---|---|---|---|---|---|---|---|---|---|
| team | |||||||||
| 1. FC Kaiserslautern | Germany | NaN | NaN | 0.144522 | -0.075758 | NaN | NaN | NaN | NaN |
| 1. FC Köln | Germany | -0.050117 | 0.002331 | 0.156177 | -0.114219 | NaN | NaN | -0.006993 | -0.038462 |
| 1. FC Nürnberg | Germany | NaN | 0.096737 | -0.045455 | -0.004662 | 0.086247 | -0.046620 | NaN | NaN |
| 1. FSV Mainz 05 | Germany | NaN | -0.036131 | -0.023310 | -0.019814 | -0.072261 | 0.057110 | -0.044289 | -0.025641 |
| AC Ajaccio | France | NaN | NaN | NaN | 0.075092 | -0.038462 | 0.029304 | NaN | NaN |
Before further analysis, the dataset shows some non-null values. That's because there are teams that did not play the 8 seasons. Let's remove them, because the analysis will be just for the teams that have the most information available.
slopes.dropna(inplace=True)
Let's define a function that is going to create the linear regression model over all seasons.
def calculate_slope(X, y):
regressor = LinearRegression()
regressor.fit(X, y)
return regressor.coef_[0]
Apply that function to all teams.
seasons = matches["season"].unique()
slopes["general_slope"] = slopes[seasons].apply(
lambda x: calculate_slope(np.arange(len(x)).reshape(-1, 1), x),
axis=1,
)
slopes.head(3)
| country | 2008/2009 | 2009/2010 | 2010/2011 | 2011/2012 | 2012/2013 | 2013/2014 | 2014/2015 | 2015/2016 | general_slope | |
|---|---|---|---|---|---|---|---|---|---|---|
| team | ||||||||||
| ADO Den Haag | Netherlands | 0.041958 | -0.030303 | -0.065268 | -0.076923 | -0.074592 | 0.150350 | -0.008159 | 0.029138 | 0.007978 |
| AS Saint-Étienne | France | 0.042125 | 0.020147 | -0.078755 | -0.051282 | 0.018315 | 0.060440 | 0.040293 | -0.032967 | 0.000741 |
| AZ | Netherlands | -0.002331 | 0.053613 | -0.033800 | -0.064103 | 0.001166 | -0.054779 | 0.081585 | 0.148019 | 0.014222 |
Perfect, what if we sort the slopes and get the top 5 teams with the most positive coefficients or slopes. This means that these teams have the best improvement over time.
slopes.sort_values(by=["general_slope"], ascending=False).head(5)
| country | 2008/2009 | 2009/2010 | 2010/2011 | 2011/2012 | 2012/2013 | 2013/2014 | 2014/2015 | 2015/2016 | general_slope | |
|---|---|---|---|---|---|---|---|---|---|---|
| team | ||||||||||
| KAA Gent | Belgium | 0.010490 | 0.030303 | -0.018182 | -0.044444 | -0.052525 | 1.333333e+00 | 0.094949 | -0.012121 | 0.050136 |
| Bayer 04 Leverkusen | Germany | -0.143357 | -0.143357 | 0.058275 | 0.051282 | 0.068765 | -5.244755e-02 | -0.015152 | 0.111888 | 0.025155 |
| Sunderland | England | -0.069597 | -0.029304 | -0.034799 | -0.012821 | -0.016484 | 9.340659e-02 | 0.012821 | 0.108059 | 0.021847 |
| TSG 1899 Hoffenheim | Germany | -0.047786 | -0.051282 | -0.113054 | -0.075758 | 0.110723 | 7.925408e-02 | -0.006993 | 0.066434 | 0.021243 |
| Lechia Gdańsk | Poland | -0.024242 | -0.115152 | -0.054545 | 0.018182 | -0.096970 | -1.222315e-17 | 0.026263 | 0.086869 | 0.018254 |
And the 5 teams with the lowest slope value. Which means they are decreasing their performance over time.
slopes.sort_values(by=["general_slope"]).head(5)
| country | 2008/2009 | 2009/2010 | 2010/2011 | 2011/2012 | 2012/2013 | 2013/2014 | 2014/2015 | 2015/2016 | general_slope | |
|---|---|---|---|---|---|---|---|---|---|---|
| team | ||||||||||
| FC Bayern Munich | Germany | 0.066434 | 0.062937 | 0.120047 | 0.031469 | 0.024476 | -0.027972 | -0.047786 | -0.023310 | -0.019439 |
| Ajax | Netherlands | 0.004662 | 0.089744 | 0.061772 | 0.135198 | 0.093240 | -0.010490 | -0.080420 | -0.053613 | -0.018065 |
| Atlético Madrid | Spain | 0.086081 | 0.018315 | 0.040293 | 0.071429 | -0.087912 | -0.087912 | -0.065934 | 0.012821 | -0.017595 |
| Everton | England | 0.067766 | 0.104396 | 0.080586 | 0.071429 | -0.012821 | -0.001832 | 0.034799 | -0.038462 | -0.016941 |
| Roma | Italy | 0.064103 | 0.128205 | 0.020147 | 0.031136 | 0.029304 | -0.102564 | -0.084249 | 0.076923 | -0.015982 |
Let's plot the improvement over all seasons for the teams that improved the most.
top_5 = slopes.sort_values(by=["general_slope"], ascending=False).head(3)
fig = go.Figure()
for team in top_5.index.to_list():
fig.add_trace(go.Scatter(x=seasons, y=slopes[seasons].loc[team], name=team))
fig.update_layout(
title="Teams with best improvement along all seasons",
xaxis_title="Seasons",
yaxis_title="Slopes",
legend_title="Teams",
font=dict(
family="Courier New",
size=14,
color="RebeccaPurple",
)
)
fig.show()
Now, plot the improvement for the teams that have the most negative slopes over all seasons.
last_5 = slopes.sort_values(by=["general_slope"]).head(3)
fig = go.Figure()
for team in last_5.index.to_list():
fig.add_trace(go.Scatter(x=seasons, y=slopes[seasons].loc[team], name=team))
fig.update_layout(
title="Teams with worst improvement along all seasons",
xaxis_title="Seasons",
yaxis_title="Slopes",
legend_title="Teams",
font=dict(
family="Courier New",
size=14,
color="RebeccaPurple",
)
)
fig.show()
The next query retrieves information from two tables Player and Player_Attributes, because the second table contains the penalties data.
players_attributes = pd.read_sql_query("""
SELECT
p.player_name
, p.birthday
, p.height
, p.weight
, pa.*
FROM Player as p
LEFT JOIN Player_Attributes pa on p.player_api_id = pa.player_api_id;
""",
con=con,
)
players_attributes
| player_name | birthday | height | weight | id | player_fifa_api_id | player_api_id | date | overall_rating | potential | ... | vision | penalties | marking | standing_tackle | sliding_tackle | gk_diving | gk_handling | gk_kicking | gk_positioning | gk_reflexes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Aaron Appindangoye | 1992-02-29 00:00:00 | 182.88 | 187 | 5 | 218353 | 505942 | 2007-02-22 00:00:00 | 61.0 | 65.0 | ... | 53.0 | 47.0 | 62.0 | 63.0 | 66.0 | 5.0 | 10.0 | 9.0 | 7.0 | 7.0 |
| 1 | Aaron Appindangoye | 1992-02-29 00:00:00 | 182.88 | 187 | 4 | 218353 | 505942 | 2015-03-20 00:00:00 | 61.0 | 65.0 | ... | 53.0 | 47.0 | 62.0 | 63.0 | 66.0 | 5.0 | 10.0 | 9.0 | 7.0 | 7.0 |
| 2 | Aaron Appindangoye | 1992-02-29 00:00:00 | 182.88 | 187 | 3 | 218353 | 505942 | 2015-09-21 00:00:00 | 62.0 | 66.0 | ... | 54.0 | 48.0 | 65.0 | 66.0 | 69.0 | 6.0 | 11.0 | 10.0 | 8.0 | 8.0 |
| 3 | Aaron Appindangoye | 1992-02-29 00:00:00 | 182.88 | 187 | 2 | 218353 | 505942 | 2015-11-19 00:00:00 | 67.0 | 71.0 | ... | 54.0 | 48.0 | 65.0 | 69.0 | 69.0 | 6.0 | 11.0 | 10.0 | 8.0 | 8.0 |
| 4 | Aaron Appindangoye | 1992-02-29 00:00:00 | 182.88 | 187 | 1 | 218353 | 505942 | 2016-02-18 00:00:00 | 67.0 | 71.0 | ... | 54.0 | 48.0 | 65.0 | 69.0 | 69.0 | 6.0 | 11.0 | 10.0 | 8.0 | 8.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 183973 | Zvjezdan Misimovic | 1982-06-05 00:00:00 | 180.34 | 176 | 183973 | 102359 | 39902 | 2010-08-30 00:00:00 | 83.0 | 85.0 | ... | 88.0 | 73.0 | 22.0 | 31.0 | 30.0 | 9.0 | 13.0 | 8.0 | 15.0 | 5.0 |
| 183974 | Zvjezdan Misimovic | 1982-06-05 00:00:00 | 180.34 | 176 | 183972 | 102359 | 39902 | 2011-02-22 00:00:00 | 81.0 | 85.0 | ... | 87.0 | 73.0 | 22.0 | 31.0 | 30.0 | 9.0 | 13.0 | 8.0 | 15.0 | 5.0 |
| 183975 | Zvjezdan Misimovic | 1982-06-05 00:00:00 | 180.34 | 176 | 183971 | 102359 | 39902 | 2011-08-30 00:00:00 | 81.0 | 81.0 | ... | 87.0 | 78.0 | 22.0 | 31.0 | 30.0 | 9.0 | 13.0 | 8.0 | 15.0 | 5.0 |
| 183976 | Zvjezdan Misimovic | 1982-06-05 00:00:00 | 180.34 | 176 | 183970 | 102359 | 39902 | 2012-02-22 00:00:00 | 81.0 | 81.0 | ... | 87.0 | 78.0 | 22.0 | 31.0 | 30.0 | 9.0 | 13.0 | 8.0 | 15.0 | 5.0 |
| 183977 | Zvjezdan Misimovic | 1982-06-05 00:00:00 | 180.34 | 176 | 183969 | 102359 | 39902 | 2012-08-31 00:00:00 | 78.0 | 78.0 | ... | 82.0 | 78.0 | 22.0 | 31.0 | 30.0 | 9.0 | 13.0 | 8.0 | 15.0 | 5.0 |
183978 rows × 46 columns
Let's check how many non-null values exist for that column.
players_attributes["penalties"].isnull().sum()
836
Let's remove those rows that contain non-null values.
players_attributes.dropna(subset=["penalties"], inplace=True)
players_attributes
| player_name | birthday | height | weight | id | player_fifa_api_id | player_api_id | date | overall_rating | potential | ... | vision | penalties | marking | standing_tackle | sliding_tackle | gk_diving | gk_handling | gk_kicking | gk_positioning | gk_reflexes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Aaron Appindangoye | 1992-02-29 00:00:00 | 182.88 | 187 | 5 | 218353 | 505942 | 2007-02-22 00:00:00 | 61.0 | 65.0 | ... | 53.0 | 47.0 | 62.0 | 63.0 | 66.0 | 5.0 | 10.0 | 9.0 | 7.0 | 7.0 |
| 1 | Aaron Appindangoye | 1992-02-29 00:00:00 | 182.88 | 187 | 4 | 218353 | 505942 | 2015-03-20 00:00:00 | 61.0 | 65.0 | ... | 53.0 | 47.0 | 62.0 | 63.0 | 66.0 | 5.0 | 10.0 | 9.0 | 7.0 | 7.0 |
| 2 | Aaron Appindangoye | 1992-02-29 00:00:00 | 182.88 | 187 | 3 | 218353 | 505942 | 2015-09-21 00:00:00 | 62.0 | 66.0 | ... | 54.0 | 48.0 | 65.0 | 66.0 | 69.0 | 6.0 | 11.0 | 10.0 | 8.0 | 8.0 |
| 3 | Aaron Appindangoye | 1992-02-29 00:00:00 | 182.88 | 187 | 2 | 218353 | 505942 | 2015-11-19 00:00:00 | 67.0 | 71.0 | ... | 54.0 | 48.0 | 65.0 | 69.0 | 69.0 | 6.0 | 11.0 | 10.0 | 8.0 | 8.0 |
| 4 | Aaron Appindangoye | 1992-02-29 00:00:00 | 182.88 | 187 | 1 | 218353 | 505942 | 2016-02-18 00:00:00 | 67.0 | 71.0 | ... | 54.0 | 48.0 | 65.0 | 69.0 | 69.0 | 6.0 | 11.0 | 10.0 | 8.0 | 8.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 183973 | Zvjezdan Misimovic | 1982-06-05 00:00:00 | 180.34 | 176 | 183973 | 102359 | 39902 | 2010-08-30 00:00:00 | 83.0 | 85.0 | ... | 88.0 | 73.0 | 22.0 | 31.0 | 30.0 | 9.0 | 13.0 | 8.0 | 15.0 | 5.0 |
| 183974 | Zvjezdan Misimovic | 1982-06-05 00:00:00 | 180.34 | 176 | 183972 | 102359 | 39902 | 2011-02-22 00:00:00 | 81.0 | 85.0 | ... | 87.0 | 73.0 | 22.0 | 31.0 | 30.0 | 9.0 | 13.0 | 8.0 | 15.0 | 5.0 |
| 183975 | Zvjezdan Misimovic | 1982-06-05 00:00:00 | 180.34 | 176 | 183971 | 102359 | 39902 | 2011-08-30 00:00:00 | 81.0 | 81.0 | ... | 87.0 | 78.0 | 22.0 | 31.0 | 30.0 | 9.0 | 13.0 | 8.0 | 15.0 | 5.0 |
| 183976 | Zvjezdan Misimovic | 1982-06-05 00:00:00 | 180.34 | 176 | 183970 | 102359 | 39902 | 2012-02-22 00:00:00 | 81.0 | 81.0 | ... | 87.0 | 78.0 | 22.0 | 31.0 | 30.0 | 9.0 | 13.0 | 8.0 | 15.0 | 5.0 |
| 183977 | Zvjezdan Misimovic | 1982-06-05 00:00:00 | 180.34 | 176 | 183969 | 102359 | 39902 | 2012-08-31 00:00:00 | 78.0 | 78.0 | ... | 82.0 | 78.0 | 22.0 | 31.0 | 30.0 | 9.0 | 13.0 | 8.0 | 15.0 | 5.0 |
183142 rows × 46 columns
Since there are multiple records for each player, the dataframe can be grouped by player and get the maximum penalties registered for each player. Then, it is possible to sort the results starting from the maximum number of penalties.
players_penalties = players_attributes.groupby("player_name")["penalties"].max()
players_penalties.sort_values(ascending=False)
player_name
Rickie Lambert 96.0
Andrea Pirlo 95.0
Xavi Hernandez 95.0
Mario Balotelli 95.0
Paul Scholes 95.0
...
Igor Stefanovic 11.0
Giedrius Arlauskis 11.0
Jakub Szumski 11.0
Timothy van der Meulen 10.0
Jakub Divis 9.0
Name: penalties, Length: 10848, dtype: float64
Rickie Lambert is the player with the most penalties registered.
First, it is important to mention that the tables involved to answer this question are Teams and Teams_Attributes from European Soccer database.
We've already read the Teams table, let's explore it and do the necessary cleaness process.
teams
| id | team_api_id | team_fifa_api_id | team_long_name | team_short_name | |
|---|---|---|---|---|---|
| 0 | 1 | 9987 | 673.0 | KRC Genk | GEN |
| 1 | 2 | 9993 | 675.0 | Beerschot AC | BAC |
| 2 | 3 | 10000 | 15005.0 | SV Zulte-Waregem | ZUL |
| 3 | 4 | 9994 | 2007.0 | Sporting Lokeren | LOK |
| 4 | 5 | 9984 | 1750.0 | KSV Cercle Brugge | CEB |
| ... | ... | ... | ... | ... | ... |
| 294 | 49479 | 10190 | 898.0 | FC St. Gallen | GAL |
| 295 | 49837 | 10191 | 1715.0 | FC Thun | THU |
| 296 | 50201 | 9777 | 324.0 | Servette FC | SER |
| 297 | 50204 | 7730 | 1862.0 | FC Lausanne-Sports | LAU |
| 298 | 51606 | 7896 | NaN | Lugano | LUG |
299 rows × 5 columns
Drop some columns from teams dataframe that are not necessary for our analysis.
teams.drop(columns=["id", "team_fifa_api_id", "team_short_name"], inplace=True)
teams
| team_api_id | team_long_name | |
|---|---|---|
| 0 | 9987 | KRC Genk |
| 1 | 9993 | Beerschot AC |
| 2 | 10000 | SV Zulte-Waregem |
| 3 | 9994 | Sporting Lokeren |
| 4 | 9984 | KSV Cercle Brugge |
| ... | ... | ... |
| 294 | 10190 | FC St. Gallen |
| 295 | 10191 | FC Thun |
| 296 | 9777 | Servette FC |
| 297 | 7730 | FC Lausanne-Sports |
| 298 | 7896 | Lugano |
299 rows × 2 columns
Check the NA values and uniqueness for this dataset
teams.isna().sum()
team_api_id 0 team_long_name 0 dtype: int64
With this we verify that we are working with a dataset with non-null values.
Let's check the uniqueness of the values. For this dataset it is important to verify the uniqueness since we are working with IDs and obviously all of them should be different in the first place.
teams.apply(lambda x: print(f"'{x.name}' is unique: {x.is_unique}"))
'team_api_id' is unique: True 'team_long_name' is unique: False
team_api_id None team_long_name None dtype: object
Given the results, it is possible to see the IDs are different but there are duplicated team's names, which means some teams were stored with different IDs. Let's check that out!
teams.loc[teams["team_long_name"].duplicated(), :]
| team_api_id | team_long_name | |
|---|---|---|
| 24 | 274581 | Royal Excel Mouscron |
| 183 | 8020 | Polonia Bytom |
| 199 | 8024 | Widzew Łódź |
It was identified which teams are duplicated. Let's analyze their data in the teams_attributes dataframe, because this data is related with teams dataframe through the team_api_id.
Let's get the data from Team_Attributes table and save it as a dataframe.
teams_attributes = pd.read_sql_query("SELECT * FROM Team_Attributes", con=con)
teams_attributes
| id | team_fifa_api_id | team_api_id | date | buildUpPlaySpeed | buildUpPlaySpeedClass | buildUpPlayDribbling | buildUpPlayDribblingClass | buildUpPlayPassing | buildUpPlayPassingClass | ... | chanceCreationShooting | chanceCreationShootingClass | chanceCreationPositioningClass | defencePressure | defencePressureClass | defenceAggression | defenceAggressionClass | defenceTeamWidth | defenceTeamWidthClass | defenceDefenderLineClass | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 434 | 9930 | 2010-02-22 00:00:00 | 60 | Balanced | NaN | Little | 50 | Mixed | ... | 55 | Normal | Organised | 50 | Medium | 55 | Press | 45 | Normal | Cover |
| 1 | 2 | 434 | 9930 | 2014-09-19 00:00:00 | 52 | Balanced | 48.0 | Normal | 56 | Mixed | ... | 64 | Normal | Organised | 47 | Medium | 44 | Press | 54 | Normal | Cover |
| 2 | 3 | 434 | 9930 | 2015-09-10 00:00:00 | 47 | Balanced | 41.0 | Normal | 54 | Mixed | ... | 64 | Normal | Organised | 47 | Medium | 44 | Press | 54 | Normal | Cover |
| 3 | 4 | 77 | 8485 | 2010-02-22 00:00:00 | 70 | Fast | NaN | Little | 70 | Long | ... | 70 | Lots | Organised | 60 | Medium | 70 | Double | 70 | Wide | Cover |
| 4 | 5 | 77 | 8485 | 2011-02-22 00:00:00 | 47 | Balanced | NaN | Little | 52 | Mixed | ... | 52 | Normal | Organised | 47 | Medium | 47 | Press | 52 | Normal | Cover |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1453 | 1454 | 15005 | 10000 | 2011-02-22 00:00:00 | 52 | Balanced | NaN | Little | 52 | Mixed | ... | 53 | Normal | Organised | 46 | Medium | 48 | Press | 53 | Normal | Cover |
| 1454 | 1455 | 15005 | 10000 | 2012-02-22 00:00:00 | 54 | Balanced | NaN | Little | 51 | Mixed | ... | 50 | Normal | Organised | 44 | Medium | 55 | Press | 53 | Normal | Cover |
| 1455 | 1456 | 15005 | 10000 | 2013-09-20 00:00:00 | 54 | Balanced | NaN | Little | 51 | Mixed | ... | 32 | Little | Organised | 44 | Medium | 58 | Press | 37 | Normal | Cover |
| 1456 | 1457 | 15005 | 10000 | 2014-09-19 00:00:00 | 54 | Balanced | 42.0 | Normal | 51 | Mixed | ... | 32 | Little | Organised | 44 | Medium | 58 | Press | 37 | Normal | Cover |
| 1457 | 1458 | 15005 | 10000 | 2015-09-10 00:00:00 | 54 | Balanced | 42.0 | Normal | 51 | Mixed | ... | 32 | Little | Organised | 44 | Medium | 58 | Press | 37 | Normal | Cover |
1458 rows × 25 columns
Next, to more easily identify which team the attributes belong to, a column with the team's name will be added to teams_attributes dataframe.
teams_attributes = pd.merge(
teams,
teams_attributes,
on=["team_api_id"],
)
teams_attributes
| team_api_id | team_long_name | id | team_fifa_api_id | date | buildUpPlaySpeed | buildUpPlaySpeedClass | buildUpPlayDribbling | buildUpPlayDribblingClass | buildUpPlayPassing | ... | chanceCreationShooting | chanceCreationShootingClass | chanceCreationPositioningClass | defencePressure | defencePressureClass | defenceAggression | defenceAggressionClass | defenceTeamWidth | defenceTeamWidthClass | defenceDefenderLineClass | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 9987 | KRC Genk | 485 | 673 | 2010-02-22 00:00:00 | 45 | Balanced | NaN | Little | 45 | ... | 60 | Normal | Organised | 70 | High | 65 | Press | 70 | Wide | Cover |
| 1 | 9987 | KRC Genk | 486 | 673 | 2011-02-22 00:00:00 | 66 | Balanced | NaN | Little | 52 | ... | 51 | Normal | Organised | 48 | Medium | 47 | Press | 54 | Normal | Offside Trap |
| 2 | 9987 | KRC Genk | 487 | 673 | 2012-02-22 00:00:00 | 53 | Balanced | NaN | Little | 55 | ... | 56 | Normal | Organised | 47 | Medium | 45 | Press | 55 | Normal | Cover |
| 3 | 9987 | KRC Genk | 488 | 673 | 2013-09-20 00:00:00 | 58 | Balanced | NaN | Little | 38 | ... | 56 | Normal | Organised | 47 | Medium | 45 | Press | 55 | Normal | Cover |
| 4 | 9987 | KRC Genk | 489 | 673 | 2014-09-19 00:00:00 | 58 | Balanced | 52.0 | Normal | 38 | ... | 56 | Normal | Organised | 47 | Medium | 45 | Press | 55 | Normal | Cover |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1453 | 9777 | Servette FC | 1156 | 324 | 2013-09-20 00:00:00 | 48 | Balanced | NaN | Little | 51 | ... | 45 | Normal | Organised | 49 | Medium | 45 | Press | 49 | Normal | Cover |
| 1454 | 7730 | FC Lausanne-Sports | 672 | 1862 | 2010-02-22 00:00:00 | 30 | Slow | NaN | Little | 60 | ... | 60 | Normal | Organised | 55 | Medium | 60 | Press | 50 | Normal | Cover |
| 1455 | 7730 | FC Lausanne-Sports | 673 | 1862 | 2012-02-22 00:00:00 | 37 | Balanced | NaN | Little | 49 | ... | 48 | Normal | Organised | 43 | Medium | 43 | Press | 55 | Normal | Cover |
| 1456 | 7730 | FC Lausanne-Sports | 674 | 1862 | 2013-09-20 00:00:00 | 51 | Balanced | NaN | Little | 49 | ... | 48 | Normal | Organised | 43 | Medium | 43 | Press | 55 | Normal | Cover |
| 1457 | 7730 | FC Lausanne-Sports | 675 | 1862 | 2014-09-19 00:00:00 | 37 | Balanced | 55.0 | Normal | 49 | ... | 52 | Normal | Organised | 46 | Medium | 44 | Press | 55 | Normal | Cover |
1458 rows × 26 columns
The column team_long_name will be renamed to team for easy management.
teams_attributes.rename(columns={"team_long_name": "team"}, inplace=True)
teams_attributes
| team_api_id | team | id | team_fifa_api_id | date | buildUpPlaySpeed | buildUpPlaySpeedClass | buildUpPlayDribbling | buildUpPlayDribblingClass | buildUpPlayPassing | ... | chanceCreationShooting | chanceCreationShootingClass | chanceCreationPositioningClass | defencePressure | defencePressureClass | defenceAggression | defenceAggressionClass | defenceTeamWidth | defenceTeamWidthClass | defenceDefenderLineClass | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 9987 | KRC Genk | 485 | 673 | 2010-02-22 00:00:00 | 45 | Balanced | NaN | Little | 45 | ... | 60 | Normal | Organised | 70 | High | 65 | Press | 70 | Wide | Cover |
| 1 | 9987 | KRC Genk | 486 | 673 | 2011-02-22 00:00:00 | 66 | Balanced | NaN | Little | 52 | ... | 51 | Normal | Organised | 48 | Medium | 47 | Press | 54 | Normal | Offside Trap |
| 2 | 9987 | KRC Genk | 487 | 673 | 2012-02-22 00:00:00 | 53 | Balanced | NaN | Little | 55 | ... | 56 | Normal | Organised | 47 | Medium | 45 | Press | 55 | Normal | Cover |
| 3 | 9987 | KRC Genk | 488 | 673 | 2013-09-20 00:00:00 | 58 | Balanced | NaN | Little | 38 | ... | 56 | Normal | Organised | 47 | Medium | 45 | Press | 55 | Normal | Cover |
| 4 | 9987 | KRC Genk | 489 | 673 | 2014-09-19 00:00:00 | 58 | Balanced | 52.0 | Normal | 38 | ... | 56 | Normal | Organised | 47 | Medium | 45 | Press | 55 | Normal | Cover |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1453 | 9777 | Servette FC | 1156 | 324 | 2013-09-20 00:00:00 | 48 | Balanced | NaN | Little | 51 | ... | 45 | Normal | Organised | 49 | Medium | 45 | Press | 49 | Normal | Cover |
| 1454 | 7730 | FC Lausanne-Sports | 672 | 1862 | 2010-02-22 00:00:00 | 30 | Slow | NaN | Little | 60 | ... | 60 | Normal | Organised | 55 | Medium | 60 | Press | 50 | Normal | Cover |
| 1455 | 7730 | FC Lausanne-Sports | 673 | 1862 | 2012-02-22 00:00:00 | 37 | Balanced | NaN | Little | 49 | ... | 48 | Normal | Organised | 43 | Medium | 43 | Press | 55 | Normal | Cover |
| 1456 | 7730 | FC Lausanne-Sports | 674 | 1862 | 2013-09-20 00:00:00 | 51 | Balanced | NaN | Little | 49 | ... | 48 | Normal | Organised | 43 | Medium | 43 | Press | 55 | Normal | Cover |
| 1457 | 7730 | FC Lausanne-Sports | 675 | 1862 | 2014-09-19 00:00:00 | 37 | Balanced | 55.0 | Normal | 49 | ... | 52 | Normal | Organised | 46 | Medium | 44 | Press | 55 | Normal | Cover |
1458 rows × 26 columns
In order to make easy to analyze data, first, we can check the cleaness of the teams_attributes dataset.
The following columns like id and team_fifa_api_id will be removed because these columns do not contain relevant information.
teams_attributes.drop(columns=["id", "team_fifa_api_id"], inplace=True)
teams_attributes
| team_api_id | team | date | buildUpPlaySpeed | buildUpPlaySpeedClass | buildUpPlayDribbling | buildUpPlayDribblingClass | buildUpPlayPassing | buildUpPlayPassingClass | buildUpPlayPositioningClass | ... | chanceCreationShooting | chanceCreationShootingClass | chanceCreationPositioningClass | defencePressure | defencePressureClass | defenceAggression | defenceAggressionClass | defenceTeamWidth | defenceTeamWidthClass | defenceDefenderLineClass | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 9987 | KRC Genk | 2010-02-22 00:00:00 | 45 | Balanced | NaN | Little | 45 | Mixed | Organised | ... | 60 | Normal | Organised | 70 | High | 65 | Press | 70 | Wide | Cover |
| 1 | 9987 | KRC Genk | 2011-02-22 00:00:00 | 66 | Balanced | NaN | Little | 52 | Mixed | Organised | ... | 51 | Normal | Organised | 48 | Medium | 47 | Press | 54 | Normal | Offside Trap |
| 2 | 9987 | KRC Genk | 2012-02-22 00:00:00 | 53 | Balanced | NaN | Little | 55 | Mixed | Organised | ... | 56 | Normal | Organised | 47 | Medium | 45 | Press | 55 | Normal | Cover |
| 3 | 9987 | KRC Genk | 2013-09-20 00:00:00 | 58 | Balanced | NaN | Little | 38 | Mixed | Organised | ... | 56 | Normal | Organised | 47 | Medium | 45 | Press | 55 | Normal | Cover |
| 4 | 9987 | KRC Genk | 2014-09-19 00:00:00 | 58 | Balanced | 52.0 | Normal | 38 | Mixed | Organised | ... | 56 | Normal | Organised | 47 | Medium | 45 | Press | 55 | Normal | Cover |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1453 | 9777 | Servette FC | 2013-09-20 00:00:00 | 48 | Balanced | NaN | Little | 51 | Mixed | Organised | ... | 45 | Normal | Organised | 49 | Medium | 45 | Press | 49 | Normal | Cover |
| 1454 | 7730 | FC Lausanne-Sports | 2010-02-22 00:00:00 | 30 | Slow | NaN | Little | 60 | Mixed | Organised | ... | 60 | Normal | Organised | 55 | Medium | 60 | Press | 50 | Normal | Cover |
| 1455 | 7730 | FC Lausanne-Sports | 2012-02-22 00:00:00 | 37 | Balanced | NaN | Little | 49 | Mixed | Organised | ... | 48 | Normal | Organised | 43 | Medium | 43 | Press | 55 | Normal | Cover |
| 1456 | 7730 | FC Lausanne-Sports | 2013-09-20 00:00:00 | 51 | Balanced | NaN | Little | 49 | Mixed | Organised | ... | 48 | Normal | Organised | 43 | Medium | 43 | Press | 55 | Normal | Cover |
| 1457 | 7730 | FC Lausanne-Sports | 2014-09-19 00:00:00 | 37 | Balanced | 55.0 | Normal | 49 | Mixed | Organised | ... | 52 | Normal | Organised | 46 | Medium | 44 | Press | 55 | Normal | Cover |
1458 rows × 24 columns
Let's check the null values of each column.
teams_attributes.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 1458 entries, 0 to 1457 Data columns (total 24 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 team_api_id 1458 non-null int64 1 team 1458 non-null object 2 date 1458 non-null object 3 buildUpPlaySpeed 1458 non-null int64 4 buildUpPlaySpeedClass 1458 non-null object 5 buildUpPlayDribbling 489 non-null float64 6 buildUpPlayDribblingClass 1458 non-null object 7 buildUpPlayPassing 1458 non-null int64 8 buildUpPlayPassingClass 1458 non-null object 9 buildUpPlayPositioningClass 1458 non-null object 10 chanceCreationPassing 1458 non-null int64 11 chanceCreationPassingClass 1458 non-null object 12 chanceCreationCrossing 1458 non-null int64 13 chanceCreationCrossingClass 1458 non-null object 14 chanceCreationShooting 1458 non-null int64 15 chanceCreationShootingClass 1458 non-null object 16 chanceCreationPositioningClass 1458 non-null object 17 defencePressure 1458 non-null int64 18 defencePressureClass 1458 non-null object 19 defenceAggression 1458 non-null int64 20 defenceAggressionClass 1458 non-null object 21 defenceTeamWidth 1458 non-null int64 22 defenceTeamWidthClass 1458 non-null object 23 defenceDefenderLineClass 1458 non-null object dtypes: float64(1), int64(9), object(14) memory usage: 284.8+ KB
After displaying teams_attributes dataframe info it is possible to see that column buildUpPlayDribbling has just 489 non-null values, so there is not enough information in that particular column and it will be dismissed in the analysis.
teams_attributes.drop(columns=["buildUpPlayDribbling"], inplace=True)
teams_attributes.shape
(1458, 23)
teams_attributes
| team_api_id | team | date | buildUpPlaySpeed | buildUpPlaySpeedClass | buildUpPlayDribblingClass | buildUpPlayPassing | buildUpPlayPassingClass | buildUpPlayPositioningClass | chanceCreationPassing | ... | chanceCreationShooting | chanceCreationShootingClass | chanceCreationPositioningClass | defencePressure | defencePressureClass | defenceAggression | defenceAggressionClass | defenceTeamWidth | defenceTeamWidthClass | defenceDefenderLineClass | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 9987 | KRC Genk | 2010-02-22 00:00:00 | 45 | Balanced | Little | 45 | Mixed | Organised | 50 | ... | 60 | Normal | Organised | 70 | High | 65 | Press | 70 | Wide | Cover |
| 1 | 9987 | KRC Genk | 2011-02-22 00:00:00 | 66 | Balanced | Little | 52 | Mixed | Organised | 65 | ... | 51 | Normal | Organised | 48 | Medium | 47 | Press | 54 | Normal | Offside Trap |
| 2 | 9987 | KRC Genk | 2012-02-22 00:00:00 | 53 | Balanced | Little | 55 | Mixed | Organised | 55 | ... | 56 | Normal | Organised | 47 | Medium | 45 | Press | 55 | Normal | Cover |
| 3 | 9987 | KRC Genk | 2013-09-20 00:00:00 | 58 | Balanced | Little | 38 | Mixed | Organised | 67 | ... | 56 | Normal | Organised | 47 | Medium | 45 | Press | 55 | Normal | Cover |
| 4 | 9987 | KRC Genk | 2014-09-19 00:00:00 | 58 | Balanced | Normal | 38 | Mixed | Organised | 67 | ... | 56 | Normal | Organised | 47 | Medium | 45 | Press | 55 | Normal | Cover |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1453 | 9777 | Servette FC | 2013-09-20 00:00:00 | 48 | Balanced | Little | 51 | Mixed | Organised | 53 | ... | 45 | Normal | Organised | 49 | Medium | 45 | Press | 49 | Normal | Cover |
| 1454 | 7730 | FC Lausanne-Sports | 2010-02-22 00:00:00 | 30 | Slow | Little | 60 | Mixed | Organised | 40 | ... | 60 | Normal | Organised | 55 | Medium | 60 | Press | 50 | Normal | Cover |
| 1455 | 7730 | FC Lausanne-Sports | 2012-02-22 00:00:00 | 37 | Balanced | Little | 49 | Mixed | Organised | 52 | ... | 48 | Normal | Organised | 43 | Medium | 43 | Press | 55 | Normal | Cover |
| 1456 | 7730 | FC Lausanne-Sports | 2013-09-20 00:00:00 | 51 | Balanced | Little | 49 | Mixed | Organised | 52 | ... | 48 | Normal | Organised | 43 | Medium | 43 | Press | 55 | Normal | Cover |
| 1457 | 7730 | FC Lausanne-Sports | 2014-09-19 00:00:00 | 37 | Balanced | Normal | 49 | Mixed | Organised | 52 | ... | 52 | Normal | Organised | 46 | Medium | 44 | Press | 55 | Normal | Cover |
1458 rows × 23 columns
Now, go back to the analysis of the duplicated teams' names. The first team found duplicated was 'Royal Excel Mouscron'.
teams_attributes[teams_attributes["team"] == "Royal Excel Mouscron"]
| team_api_id | team | date | buildUpPlaySpeed | buildUpPlaySpeedClass | buildUpPlayDribblingClass | buildUpPlayPassing | buildUpPlayPassingClass | buildUpPlayPositioningClass | chanceCreationPassing | ... | chanceCreationShooting | chanceCreationShootingClass | chanceCreationPositioningClass | defencePressure | defencePressureClass | defenceAggression | defenceAggressionClass | defenceTeamWidth | defenceTeamWidthClass | defenceDefenderLineClass | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 68 | 9996 | Royal Excel Mouscron | 2015-09-10 00:00:00 | 50 | Balanced | Normal | 50 | Mixed | Organised | 50 | ... | 50 | Normal | Organised | 45 | Medium | 45 | Press | 50 | Normal | Cover |
| 69 | 9996 | Royal Excel Mouscron | 2015-09-10 00:00:00 | 50 | Balanced | Normal | 50 | Mixed | Organised | 50 | ... | 50 | Normal | Organised | 45 | Medium | 45 | Press | 50 | Normal | Cover |
| 96 | 274581 | Royal Excel Mouscron | 2015-09-10 00:00:00 | 50 | Balanced | Normal | 50 | Mixed | Organised | 50 | ... | 50 | Normal | Organised | 45 | Medium | 45 | Press | 50 | Normal | Cover |
3 rows × 23 columns
Let's verify if the whole three rows have the same values except witht the column team_api_id:
teams_attributes \
.loc[:, teams_attributes.columns[1:]] \
.loc[teams_attributes["team"] == "Royal Excel Mouscron"] \
.duplicated()
68 False 69 True 96 True dtype: bool
Let's remove the rows that are duplicated except for the column team_api_id.
teams_attributes.drop_duplicates(subset=teams_attributes.columns[1:], inplace=True)
teams_attributes
| team_api_id | team | date | buildUpPlaySpeed | buildUpPlaySpeedClass | buildUpPlayDribblingClass | buildUpPlayPassing | buildUpPlayPassingClass | buildUpPlayPositioningClass | chanceCreationPassing | ... | chanceCreationShooting | chanceCreationShootingClass | chanceCreationPositioningClass | defencePressure | defencePressureClass | defenceAggression | defenceAggressionClass | defenceTeamWidth | defenceTeamWidthClass | defenceDefenderLineClass | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 9987 | KRC Genk | 2010-02-22 00:00:00 | 45 | Balanced | Little | 45 | Mixed | Organised | 50 | ... | 60 | Normal | Organised | 70 | High | 65 | Press | 70 | Wide | Cover |
| 1 | 9987 | KRC Genk | 2011-02-22 00:00:00 | 66 | Balanced | Little | 52 | Mixed | Organised | 65 | ... | 51 | Normal | Organised | 48 | Medium | 47 | Press | 54 | Normal | Offside Trap |
| 2 | 9987 | KRC Genk | 2012-02-22 00:00:00 | 53 | Balanced | Little | 55 | Mixed | Organised | 55 | ... | 56 | Normal | Organised | 47 | Medium | 45 | Press | 55 | Normal | Cover |
| 3 | 9987 | KRC Genk | 2013-09-20 00:00:00 | 58 | Balanced | Little | 38 | Mixed | Organised | 67 | ... | 56 | Normal | Organised | 47 | Medium | 45 | Press | 55 | Normal | Cover |
| 4 | 9987 | KRC Genk | 2014-09-19 00:00:00 | 58 | Balanced | Normal | 38 | Mixed | Organised | 67 | ... | 56 | Normal | Organised | 47 | Medium | 45 | Press | 55 | Normal | Cover |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1453 | 9777 | Servette FC | 2013-09-20 00:00:00 | 48 | Balanced | Little | 51 | Mixed | Organised | 53 | ... | 45 | Normal | Organised | 49 | Medium | 45 | Press | 49 | Normal | Cover |
| 1454 | 7730 | FC Lausanne-Sports | 2010-02-22 00:00:00 | 30 | Slow | Little | 60 | Mixed | Organised | 40 | ... | 60 | Normal | Organised | 55 | Medium | 60 | Press | 50 | Normal | Cover |
| 1455 | 7730 | FC Lausanne-Sports | 2012-02-22 00:00:00 | 37 | Balanced | Little | 49 | Mixed | Organised | 52 | ... | 48 | Normal | Organised | 43 | Medium | 43 | Press | 55 | Normal | Cover |
| 1456 | 7730 | FC Lausanne-Sports | 2013-09-20 00:00:00 | 51 | Balanced | Little | 49 | Mixed | Organised | 52 | ... | 48 | Normal | Organised | 43 | Medium | 43 | Press | 55 | Normal | Cover |
| 1457 | 7730 | FC Lausanne-Sports | 2014-09-19 00:00:00 | 37 | Balanced | Normal | 49 | Mixed | Organised | 52 | ... | 52 | Normal | Organised | 46 | Medium | 44 | Press | 55 | Normal | Cover |
1450 rows × 23 columns
Finally, it is verified that there are the same unique teams' IDs and teams' names.
teams_attributes["team_api_id"].nunique()
285
teams_attributes["team"].nunique()
285
teams_attributes["date"] = pd.to_datetime(teams_attributes["date"], format="%Y-%m-%d")
teams_attributes.dtypes
team_api_id int64 team object date datetime64[ns] buildUpPlaySpeed int64 buildUpPlaySpeedClass object buildUpPlayDribblingClass object buildUpPlayPassing int64 buildUpPlayPassingClass object buildUpPlayPositioningClass object chanceCreationPassing int64 chanceCreationPassingClass object chanceCreationCrossing int64 chanceCreationCrossingClass object chanceCreationShooting int64 chanceCreationShootingClass object chanceCreationPositioningClass object defencePressure int64 defencePressureClass object defenceAggression int64 defenceAggressionClass object defenceTeamWidth int64 defenceTeamWidthClass object defenceDefenderLineClass object dtype: object
For what period of time there are attributes of teams?
teams_attributes["date"].min(), teams_attributes["date"].max()
(Timestamp('2010-02-22 00:00:00'), Timestamp('2015-09-10 00:00:00'))
Let's explore a little bit the teams_attribute dataset.
Looking at the dataset it is possible to see there are different observations related to one team, so let's check what is the average of observations per team.
teams_attributes.groupby("team")["date"].count().mean()
5.087719298245614
Now, what's the minimum and the maximum of observations.
teams_attributes.groupby("team")["date"].count().min()
1
teams_attributes.groupby("team")["date"].count().max()
6
The next plot represents at which time the observations were recorded.
fig = go.Figure(data=go.Scatter(x=teams_attributes["date"], y=teams_attributes.index, mode="markers"))
fig.update_layout(
title="Dates where the teams attributes were registered",
xaxis_title="Date",
yaxis_title="Observations IDs"
)
fig.show()
As it is shown, it is possible to relate the observations with the seasons played and then relate those attributes with the results in the given matches of the season.
To do that, it's necessary to know when the seasons started and finished per league/country.
seasons = matches.groupby(["season", "country"])["date"].agg(["min", "max"]).reset_index()
seasons
| season | country | min | max | |
|---|---|---|---|---|
| 0 | 2008/2009 | Belgium | 2008-08-16 | 2009-05-16 |
| 1 | 2008/2009 | England | 2008-08-16 | 2009-05-24 |
| 2 | 2008/2009 | France | 2008-08-09 | 2009-05-30 |
| 3 | 2008/2009 | Germany | 2008-08-15 | 2009-05-23 |
| 4 | 2008/2009 | Italy | 2008-08-30 | 2009-05-31 |
| ... | ... | ... | ... | ... |
| 83 | 2015/2016 | Poland | 2015-07-17 | 2016-04-09 |
| 84 | 2015/2016 | Portugal | 2015-08-14 | 2016-05-15 |
| 85 | 2015/2016 | Scotland | 2015-08-01 | 2016-05-15 |
| 86 | 2015/2016 | Spain | 2015-08-21 | 2016-05-15 |
| 87 | 2015/2016 | Switzerland | 2015-07-18 | 2016-05-25 |
88 rows × 4 columns
Then, it is necessary to relate each observation to the appropriate seasons. That's for the next function.
def get_season(country, date):
mask = (seasons["min"] <= date) & (date <= seasons["max"]) & (seasons["country"] == country)
season = seasons.loc[mask, "season"]
return season.iloc[0] if len(season) else np.nan
teams_attributes
| team_api_id | team | date | buildUpPlaySpeed | buildUpPlaySpeedClass | buildUpPlayDribblingClass | buildUpPlayPassing | buildUpPlayPassingClass | buildUpPlayPositioningClass | chanceCreationPassing | ... | chanceCreationShooting | chanceCreationShootingClass | chanceCreationPositioningClass | defencePressure | defencePressureClass | defenceAggression | defenceAggressionClass | defenceTeamWidth | defenceTeamWidthClass | defenceDefenderLineClass | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 9987 | KRC Genk | 2010-02-22 | 45 | Balanced | Little | 45 | Mixed | Organised | 50 | ... | 60 | Normal | Organised | 70 | High | 65 | Press | 70 | Wide | Cover |
| 1 | 9987 | KRC Genk | 2011-02-22 | 66 | Balanced | Little | 52 | Mixed | Organised | 65 | ... | 51 | Normal | Organised | 48 | Medium | 47 | Press | 54 | Normal | Offside Trap |
| 2 | 9987 | KRC Genk | 2012-02-22 | 53 | Balanced | Little | 55 | Mixed | Organised | 55 | ... | 56 | Normal | Organised | 47 | Medium | 45 | Press | 55 | Normal | Cover |
| 3 | 9987 | KRC Genk | 2013-09-20 | 58 | Balanced | Little | 38 | Mixed | Organised | 67 | ... | 56 | Normal | Organised | 47 | Medium | 45 | Press | 55 | Normal | Cover |
| 4 | 9987 | KRC Genk | 2014-09-19 | 58 | Balanced | Normal | 38 | Mixed | Organised | 67 | ... | 56 | Normal | Organised | 47 | Medium | 45 | Press | 55 | Normal | Cover |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1453 | 9777 | Servette FC | 2013-09-20 | 48 | Balanced | Little | 51 | Mixed | Organised | 53 | ... | 45 | Normal | Organised | 49 | Medium | 45 | Press | 49 | Normal | Cover |
| 1454 | 7730 | FC Lausanne-Sports | 2010-02-22 | 30 | Slow | Little | 60 | Mixed | Organised | 40 | ... | 60 | Normal | Organised | 55 | Medium | 60 | Press | 50 | Normal | Cover |
| 1455 | 7730 | FC Lausanne-Sports | 2012-02-22 | 37 | Balanced | Little | 49 | Mixed | Organised | 52 | ... | 48 | Normal | Organised | 43 | Medium | 43 | Press | 55 | Normal | Cover |
| 1456 | 7730 | FC Lausanne-Sports | 2013-09-20 | 51 | Balanced | Little | 49 | Mixed | Organised | 52 | ... | 48 | Normal | Organised | 43 | Medium | 43 | Press | 55 | Normal | Cover |
| 1457 | 7730 | FC Lausanne-Sports | 2014-09-19 | 37 | Balanced | Normal | 49 | Mixed | Organised | 52 | ... | 52 | Normal | Organised | 46 | Medium | 44 | Press | 55 | Normal | Cover |
1450 rows × 23 columns
In the seasons dataframe (with the information of the start and the end) there are two important pieces of information: the country and the season. On the other hand, in the team_attributes dataframe there is just the date of the observation and team associated with, so it's necessary to add the country to match the information from both datasets.
countries_teams = matches[["country", "home_team"]].drop_duplicates().rename(columns={"home_team": "team"})
teams_attributes = pd.merge(
countries_teams,
teams_attributes,
on=["team"],
)
teams_attributes
| country | team | team_api_id | date | buildUpPlaySpeed | buildUpPlaySpeedClass | buildUpPlayDribblingClass | buildUpPlayPassing | buildUpPlayPassingClass | buildUpPlayPositioningClass | ... | chanceCreationShooting | chanceCreationShootingClass | chanceCreationPositioningClass | defencePressure | defencePressureClass | defenceAggression | defenceAggressionClass | defenceTeamWidth | defenceTeamWidthClass | defenceDefenderLineClass | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Belgium | KRC Genk | 9987 | 2010-02-22 | 45 | Balanced | Little | 45 | Mixed | Organised | ... | 60 | Normal | Organised | 70 | High | 65 | Press | 70 | Wide | Cover |
| 1 | Belgium | KRC Genk | 9987 | 2011-02-22 | 66 | Balanced | Little | 52 | Mixed | Organised | ... | 51 | Normal | Organised | 48 | Medium | 47 | Press | 54 | Normal | Offside Trap |
| 2 | Belgium | KRC Genk | 9987 | 2012-02-22 | 53 | Balanced | Little | 55 | Mixed | Organised | ... | 56 | Normal | Organised | 47 | Medium | 45 | Press | 55 | Normal | Cover |
| 3 | Belgium | KRC Genk | 9987 | 2013-09-20 | 58 | Balanced | Little | 38 | Mixed | Organised | ... | 56 | Normal | Organised | 47 | Medium | 45 | Press | 55 | Normal | Cover |
| 4 | Belgium | KRC Genk | 9987 | 2014-09-19 | 58 | Balanced | Normal | 38 | Mixed | Organised | ... | 56 | Normal | Organised | 47 | Medium | 45 | Press | 55 | Normal | Cover |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1445 | Switzerland | Servette FC | 9777 | 2013-09-20 | 48 | Balanced | Little | 51 | Mixed | Organised | ... | 45 | Normal | Organised | 49 | Medium | 45 | Press | 49 | Normal | Cover |
| 1446 | Switzerland | FC Lausanne-Sports | 7730 | 2010-02-22 | 30 | Slow | Little | 60 | Mixed | Organised | ... | 60 | Normal | Organised | 55 | Medium | 60 | Press | 50 | Normal | Cover |
| 1447 | Switzerland | FC Lausanne-Sports | 7730 | 2012-02-22 | 37 | Balanced | Little | 49 | Mixed | Organised | ... | 48 | Normal | Organised | 43 | Medium | 43 | Press | 55 | Normal | Cover |
| 1448 | Switzerland | FC Lausanne-Sports | 7730 | 2013-09-20 | 51 | Balanced | Little | 49 | Mixed | Organised | ... | 48 | Normal | Organised | 43 | Medium | 43 | Press | 55 | Normal | Cover |
| 1449 | Switzerland | FC Lausanne-Sports | 7730 | 2014-09-19 | 37 | Balanced | Normal | 49 | Mixed | Organised | ... | 52 | Normal | Organised | 46 | Medium | 44 | Press | 55 | Normal | Cover |
1450 rows × 24 columns
Once there is the country column, it is possible to associate a season to an observation of attributes using the function defined few cells above.
teams_attributes["season"] = teams_attributes.apply(lambda x: get_season(x["country"], x["date"]), axis=1)
teams_attributes
| country | team | team_api_id | date | buildUpPlaySpeed | buildUpPlaySpeedClass | buildUpPlayDribblingClass | buildUpPlayPassing | buildUpPlayPassingClass | buildUpPlayPositioningClass | ... | chanceCreationShootingClass | chanceCreationPositioningClass | defencePressure | defencePressureClass | defenceAggression | defenceAggressionClass | defenceTeamWidth | defenceTeamWidthClass | defenceDefenderLineClass | season | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Belgium | KRC Genk | 9987 | 2010-02-22 | 45 | Balanced | Little | 45 | Mixed | Organised | ... | Normal | Organised | 70 | High | 65 | Press | 70 | Wide | Cover | 2009/2010 |
| 1 | Belgium | KRC Genk | 9987 | 2011-02-22 | 66 | Balanced | Little | 52 | Mixed | Organised | ... | Normal | Organised | 48 | Medium | 47 | Press | 54 | Normal | Offside Trap | 2010/2011 |
| 2 | Belgium | KRC Genk | 9987 | 2012-02-22 | 53 | Balanced | Little | 55 | Mixed | Organised | ... | Normal | Organised | 47 | Medium | 45 | Press | 55 | Normal | Cover | 2011/2012 |
| 3 | Belgium | KRC Genk | 9987 | 2013-09-20 | 58 | Balanced | Little | 38 | Mixed | Organised | ... | Normal | Organised | 47 | Medium | 45 | Press | 55 | Normal | Cover | NaN |
| 4 | Belgium | KRC Genk | 9987 | 2014-09-19 | 58 | Balanced | Normal | 38 | Mixed | Organised | ... | Normal | Organised | 47 | Medium | 45 | Press | 55 | Normal | Cover | 2014/2015 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1445 | Switzerland | Servette FC | 9777 | 2013-09-20 | 48 | Balanced | Little | 51 | Mixed | Organised | ... | Normal | Organised | 49 | Medium | 45 | Press | 49 | Normal | Cover | 2013/2014 |
| 1446 | Switzerland | FC Lausanne-Sports | 7730 | 2010-02-22 | 30 | Slow | Little | 60 | Mixed | Organised | ... | Normal | Organised | 55 | Medium | 60 | Press | 50 | Normal | Cover | 2009/2010 |
| 1447 | Switzerland | FC Lausanne-Sports | 7730 | 2012-02-22 | 37 | Balanced | Little | 49 | Mixed | Organised | ... | Normal | Organised | 43 | Medium | 43 | Press | 55 | Normal | Cover | 2011/2012 |
| 1448 | Switzerland | FC Lausanne-Sports | 7730 | 2013-09-20 | 51 | Balanced | Little | 49 | Mixed | Organised | ... | Normal | Organised | 43 | Medium | 43 | Press | 55 | Normal | Cover | 2013/2014 |
| 1449 | Switzerland | FC Lausanne-Sports | 7730 | 2014-09-19 | 37 | Balanced | Normal | 49 | Mixed | Organised | ... | Normal | Organised | 46 | Medium | 44 | Press | 55 | Normal | Cover | 2014/2015 |
1450 rows × 25 columns
We have worked with two datasets: teams and teams_attributes. In order to answer the question (to get the most relevant attributes that lead to the most victories), it's necessary to relate those attributes to the results of the matches. We already know which attributes are related to which seasons, now, let's relate those attributes to the results of the matches on those seasons.
The results of the matches are in matches dataframe.
statistics = pd.concat(
[
matches[["country", "season", "home_team", "home_team_points"]].rename(columns={"home_team": "team", "home_team_points": "points"}),
matches[["country", "season", "away_team", "away_team_points"]].rename(columns={"away_team": "team", "away_team_points": "points"}),
],
ignore_index=True,
)
statistics
| country | season | team | points | |
|---|---|---|---|---|
| 0 | Belgium | 2008/2009 | KRC Genk | 1 |
| 1 | Belgium | 2008/2009 | SV Zulte-Waregem | 1 |
| 2 | Belgium | 2008/2009 | KSV Cercle Brugge | 0 |
| 3 | Belgium | 2008/2009 | KAA Gent | 3 |
| 4 | Belgium | 2008/2009 | FCV Dender EH | 0 |
| ... | ... | ... | ... | ... |
| 51953 | Switzerland | 2015/2016 | FC Thun | 0 |
| 51954 | Switzerland | 2015/2016 | FC Luzern | 3 |
| 51955 | Switzerland | 2015/2016 | FC Sion | 0 |
| 51956 | Switzerland | 2015/2016 | FC Zürich | 1 |
| 51957 | Switzerland | 2015/2016 | FC Basel | 0 |
51958 rows × 4 columns
Remove the teams in statistics dataframe for which there are no attributes.
teams_with_attributes = teams_attributes["team"].unique()
statistics = statistics[statistics["team"].isin(teams_with_attributes)]
statistics
| country | season | team | points | |
|---|---|---|---|---|
| 0 | Belgium | 2008/2009 | KRC Genk | 1 |
| 1 | Belgium | 2008/2009 | SV Zulte-Waregem | 1 |
| 2 | Belgium | 2008/2009 | KSV Cercle Brugge | 0 |
| 3 | Belgium | 2008/2009 | KAA Gent | 3 |
| 5 | Belgium | 2008/2009 | KV Mechelen | 1 |
| ... | ... | ... | ... | ... |
| 51953 | Switzerland | 2015/2016 | FC Thun | 0 |
| 51954 | Switzerland | 2015/2016 | FC Luzern | 3 |
| 51955 | Switzerland | 2015/2016 | FC Sion | 0 |
| 51956 | Switzerland | 2015/2016 | FC Zürich | 1 |
| 51957 | Switzerland | 2015/2016 | FC Basel | 0 |
51602 rows × 4 columns
Verify there are the same teams in both dataframes.
set(statistics["team"].unique()) == set(teams_attributes["team"].unique())
True
statistics["points"].value_counts()
3 19304 0 19193 1 13105 Name: points, dtype: int64
statistics.replace(to_replace={"points": {3: "won", 1: "drawn", 0: "lost"}}, inplace=True)
statistics
/tmp/ipykernel_131367/418106478.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
| country | season | team | points | |
|---|---|---|---|---|
| 0 | Belgium | 2008/2009 | KRC Genk | drawn |
| 1 | Belgium | 2008/2009 | SV Zulte-Waregem | drawn |
| 2 | Belgium | 2008/2009 | KSV Cercle Brugge | lost |
| 3 | Belgium | 2008/2009 | KAA Gent | won |
| 5 | Belgium | 2008/2009 | KV Mechelen | drawn |
| ... | ... | ... | ... | ... |
| 51953 | Switzerland | 2015/2016 | FC Thun | lost |
| 51954 | Switzerland | 2015/2016 | FC Luzern | won |
| 51955 | Switzerland | 2015/2016 | FC Sion | lost |
| 51956 | Switzerland | 2015/2016 | FC Zürich | drawn |
| 51957 | Switzerland | 2015/2016 | FC Basel | lost |
51602 rows × 4 columns
statistics.rename(columns={"points": "game_status"}, inplace=True)
statistics
/tmp/ipykernel_131367/495333763.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
| country | season | team | game_status | |
|---|---|---|---|---|
| 0 | Belgium | 2008/2009 | KRC Genk | drawn |
| 1 | Belgium | 2008/2009 | SV Zulte-Waregem | drawn |
| 2 | Belgium | 2008/2009 | KSV Cercle Brugge | lost |
| 3 | Belgium | 2008/2009 | KAA Gent | won |
| 5 | Belgium | 2008/2009 | KV Mechelen | drawn |
| ... | ... | ... | ... | ... |
| 51953 | Switzerland | 2015/2016 | FC Thun | lost |
| 51954 | Switzerland | 2015/2016 | FC Luzern | won |
| 51955 | Switzerland | 2015/2016 | FC Sion | lost |
| 51956 | Switzerland | 2015/2016 | FC Zürich | drawn |
| 51957 | Switzerland | 2015/2016 | FC Basel | lost |
51602 rows × 4 columns
Our main goal is to identify which attributes are more relevant. The first strategy to solve that is to apply a model that can predict the result of a match, and then analyze the coefficients for each attribute (in the case of the model, the attributes are going to represent the features model) and check which ones have more relevance.
But before that, let's analyze the distribution of the categorical attributes in the three types of game status: won, drawn, lost. After that, numerical attributes will also be analyzed.
categorical_features = [
"buildUpPlaySpeedClass",
"buildUpPlayDribblingClass",
"buildUpPlayPassingClass",
"buildUpPlayPositioningClass",
"chanceCreationPassingClass",
"chanceCreationCrossingClass",
"chanceCreationShootingClass",
"chanceCreationPositioningClass",
"defencePressureClass",
"defenceAggressionClass",
"defenceTeamWidthClass",
"defenceDefenderLineClass",
]
numerical_features = [
"buildUpPlaySpeed",
"buildUpPlayPassing",
"chanceCreationPassing",
"chanceCreationCrossing",
"chanceCreationShooting",
"defencePressure",
"defenceAggression",
"defenceTeamWidth",
]
Join the attributes and the results.
data = pd.merge(
statistics,
teams_attributes,
on=["season", "country", "team"],
)
data
| country | season | team | game_status | team_api_id | date | buildUpPlaySpeed | buildUpPlaySpeedClass | buildUpPlayDribblingClass | buildUpPlayPassing | ... | chanceCreationShooting | chanceCreationShootingClass | chanceCreationPositioningClass | defencePressure | defencePressureClass | defenceAggression | defenceAggressionClass | defenceTeamWidth | defenceTeamWidthClass | defenceDefenderLineClass | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Belgium | 2009/2010 | Standard de Liège | drawn | 9985 | 2010-02-22 | 53 | Balanced | Little | 40 | ... | 65 | Normal | Organised | 70 | High | 70 | Double | 65 | Normal | Cover |
| 1 | Belgium | 2009/2010 | Standard de Liège | won | 9985 | 2010-02-22 | 53 | Balanced | Little | 40 | ... | 65 | Normal | Organised | 70 | High | 70 | Double | 65 | Normal | Cover |
| 2 | Belgium | 2009/2010 | Standard de Liège | drawn | 9985 | 2010-02-22 | 53 | Balanced | Little | 40 | ... | 65 | Normal | Organised | 70 | High | 70 | Double | 65 | Normal | Cover |
| 3 | Belgium | 2009/2010 | Standard de Liège | won | 9985 | 2010-02-22 | 53 | Balanced | Little | 40 | ... | 65 | Normal | Organised | 70 | High | 70 | Double | 65 | Normal | Cover |
| 4 | Belgium | 2009/2010 | Standard de Liège | drawn | 9985 | 2010-02-22 | 53 | Balanced | Little | 40 | ... | 65 | Normal | Organised | 70 | High | 70 | Double | 65 | Normal | Cover |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 37515 | Switzerland | 2015/2016 | BSC Young Boys | lost | 10192 | 2015-09-10 | 52 | Balanced | Normal | 64 | ... | 46 | Normal | Organised | 44 | Medium | 34 | Press | 50 | Normal | Cover |
| 37516 | Switzerland | 2015/2016 | BSC Young Boys | won | 10192 | 2015-09-10 | 52 | Balanced | Normal | 64 | ... | 46 | Normal | Organised | 44 | Medium | 34 | Press | 50 | Normal | Cover |
| 37517 | Switzerland | 2015/2016 | BSC Young Boys | won | 10192 | 2015-09-10 | 52 | Balanced | Normal | 64 | ... | 46 | Normal | Organised | 44 | Medium | 34 | Press | 50 | Normal | Cover |
| 37518 | Switzerland | 2015/2016 | BSC Young Boys | won | 10192 | 2015-09-10 | 52 | Balanced | Normal | 64 | ... | 46 | Normal | Organised | 44 | Medium | 34 | Press | 50 | Normal | Cover |
| 37519 | Switzerland | 2015/2016 | BSC Young Boys | lost | 10192 | 2015-09-10 | 52 | Balanced | Normal | 64 | ... | 46 | Normal | Organised | 44 | Medium | 34 | Press | 50 | Normal | Cover |
37520 rows × 26 columns
The following plots are from the categorical variables and represent the proportion of the different categories of the attributes grouped by the result of the matches.
for feature in categorical_features:
groups = data.groupby(feature, sort=False)
figure_data = list()
for group_name, values in groups:
counts = values["game_status"].value_counts(normalize=True) * 100.0
figure_data += [go.Bar(name=group_name, x=counts.index, y=counts, texttemplate="%{y:.1f}", textposition="outside")]
fig = go.Figure(data=figure_data)
fig.update_layout(
title=feature,
yaxis_title="percentage",
barmode="group",
)
fig.show()
It is possible to see that there's no clear significant difference between the attributes and the results obtained.
Let's see what happens with numerical variables using histograms of the value's distribution.
for feature in numerical_features:
fig = go.Figure()
groups = data.groupby("game_status")
for group_name, values in groups:
fig.add_trace(go.Histogram(x=values[feature], name=group_name))
fig.update_layout(
barmode="overlay",
title_text=feature,
xaxis_title_text=feature,
yaxis_title_text="counts",
)
fig.update_traces(opacity=0.75)
fig.show()
The distribution of the numerical variables are pretty similar, therefore it is difficult to notice a significant difference between the 3 different outputs of matches.
Recalling the proposal of creating a model to predict the output of the matches, we can categorize the output as won, drawn and lost (because we are interested in the victories). A model that can be useful is Logistic Regression. This is a model used for classification problems, and given that we want to classify whether a team won, drew or lost a match, this is a good approach. Another thing is that once this model is trained, it is possible to analyze the regression coefficients in order to get the relevance of each attribute in the prediction.
The numerical features are used to train the model and the object variable is game_status
X = data.loc[:, numerical_features]
scaler = StandardScaler()
X = scaler.fit_transform(X)
y = data["game_status"]
model = LogisticRegression(multi_class="ovr")
model.fit(X,y)
LogisticRegression(multi_class='ovr')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression(multi_class='ovr')
Once the model is fitted with the numerical features, we can check how well it predicts the status of the matches.
predictions = model.predict(X)
Counter(predictions)
Counter({'won': 19792, 'lost': 17728})
The predictions only return two out of three classes, and that is not a good sign because it is possible to see a lack of precision in the model.
Let's check the score which is the mean accuracy of the data.
model.score(X, y)
0.41409914712153517
This score is better than a model that randomly assigns a class, because for that case we could get an accuracy of 0.333. Nevertheless, the model is just predicting two classes which is not good.
Let's check the regression coefficients.
The classes have the following order:
model.classes_
array(['drawn', 'lost', 'won'], dtype=object)
The coefficients for class drawn are:
model.coef_[0]
array([ 0.01834855, 0.01630169, -0.01604067, -0.01510765, 0.00953911,
-0.04469597, 0.00571335, 0.01451332])
The coefficients for class lost are:
model.coef_[1]
array([-0.0294302 , 0.12081135, -0.01215875, -0.06227151, -0.02938927,
-0.09400256, -0.01550754, 0.01593723])
The coefficients for class won are:
model.coef_[2]
array([ 0.01394338, -0.13328622, 0.02334461, 0.07441526, 0.02213145,
0.12898247, 0.01048978, -0.02713506])
The following are the coefficients converted to probabilities, and we can see that almost all are roughly the same for the three classes.
[np.exp(x)/(1+np.exp(x)) for x in model.coef_[0]]
[0.5045870089957382, 0.504075333013677, 0.4959899189333752, 0.4962231599566076, 0.5023847604231396, 0.48882786827775615, 0.5014283343399757, 0.5036282662412285]
[np.exp(x)/(1+np.exp(x)) for x in model.coef_[1]]
[0.49264298046337285, 0.5301661561357174, 0.49696034984685244, 0.4844371515151573, 0.4926532116311339, 0.4765166498049305, 0.4961231917436247, 0.5039842224328268]
[np.exp(x)/(1+np.exp(x)) for x in model.coef_[2]]
[0.5034857875762043, 0.4667276879921658, 0.5058358867192281, 0.5185952346280284, 0.5055326355390368, 0.5322009863934838, 0.5026224203358294, 0.49321665033619405]
From the European Soccer database I can conclude that even though there are teams at the top of everyone else during all seasons, that doesn't mean they are the teams that improved the most over time.
On the contrary, the teams that improved the most are KAA Gent from Belgium Jupiler League, Bayer 04 Leverkusen from Bundesliga in Germany and Sunderland from the England Premier League. And one thing that's so interesting is that the teams I found decreased their improvement the most over time are FC Bayern Munich from Germany, Ajax from Netherlands and Atlético de Madrid from Spain even though they are top clubs around the world.
Regarding the attributes of the teams that could lead to the most victories, unfortunately, it was not possible to discover something relevant on that part. The attributes are evenly distributed over all the clubs and are not as decisive as would have been expected.
One limitation that should be considered for this particular analysis is that, for the table matches, there is no data related to the players who participated in each game. For that reason, it became difficult to predict which players and what attributes contributed to the most victories. Likewise, another relevant limitation is the absence of metadata which led to the reduced range of analysis since it was hard to interpret the data itself and therefore it was hard to know the contribution and interaction between them.
The characteristics of the team attributes along with the two limitations explained before undermined a deeper and more meaningful analysis intended to, for example, a decision-making process associated with improving performance as a team.